BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies

Spatial transcriptomic studies are reaching single-cell spatial resolution, with data often collected from multiple tissue sections. Here, we present a computational method, BASS, that enables multi-scale and multi-sample analysis for single-cell resolution spatial transcriptomics. BASS performs cell type clustering at the single-cell scale and spatial domain detection at the tissue regional scale, with the two tasks carried out simultaneously within a Bayesian hierarchical modeling framework. We illustrate the benefits of BASS through comprehensive simulations and applications to three datasets. The substantial power gain brought by BASS allows us to reveal accurate transcriptomic and cellular landscape in both cortex and hypothalamus. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02734-7.

where ∼ . denotes all neighboring pairs in the graph (%) ; ( & (%) = & " (%) ) is an indicator function that equals 1 if both the th and . th cells belong to the same spatial domain and equals 0 otherwise; is the interaction parameter that determines the extent of spatial domain similarity among neighboring locations; and (%) ( ) is the normalizing constant, also known as the partition function, that ensures the above probability mass function to have a summation of one across all possible configurations of (%) .

Prior specifications
We treat all the hyper-parameters in the above equations ( ! , , " , ) as unknown and specify priors on them in order to infer them based on the data at hand. Specifically, we specify a normal-gamma prior on ! : Above, we assume that in each feature dimension , the mean parameter of cell type , $! , follows a normal distribution with a feature-specific mean parameter $ and a variance parameter $ $ / , where $ is a feature-specific scaling factor following a priori a gamma distribution with parameters # and / and $ is the range of the th expression feature. To simplify the algebra, we further denote =^#, … , 0 _ * and = ( # # / , … , 0 0 / ) . The normal-gamma prior is applied as a shrinkage prior on the mean parameter of expression features. Intuitively, it pulls together the mean parameters of different cell types ( $# , … , $+ ) by shrinking $ to a relatively small value if the th feature is uninformative for distinguishing different cell types, thus yielding more precise estimates of the mean parameters. Following Malsiner-Walli et al.
[75], we specify hyper-parameters # and / to be 0.5 to allow considerable shrinkage of the prior variance of the mean parameters.
For the other parameters ( , " , ), we specify the following priors: " ∼ ( 2 + * ), ∼ (0, 345 ). (10) Above, we place an inverse Wishart prior on with 2 degrees of freedom and a symmetric positive definite scale matrix 2 of size × ; a Dirichlet prior on " with concentration parameter 2 ; and a uniform prior on with a lower bound of 0 and an upper bound of 345 . We set 2 to be 1 and 2 to be to provide a weak prior on the covariance matrix; set 2 to be 1 to encode equal prior probabilities for all possible cell type compositions; and set 345 to be 4 to represent the extreme case where the spatial domain boundaries are extremely smooth.

Posterior sampling algorithm
With the above model setup and prior specifications, we develop a Gibbs sampling algorithm in combination with a Metropolis-Hastings algorithm to infer all the parameters including & (%) , & (%) , " , , ! , , $ , and $ .

Posterior sampling of cell type labels ( ) :
The full conditional distribution of the cell type labels takes the form of a categorical distribution, where the probability of being cell type for the th cell on tissue section is given by: where ? (%) = is a shorthand notation for & (%) = for all ∈ .

Posterior sampling of cell type compositions :
The full conditional distribution of the cell type composition vector, " , takes the following form: ( " | , ) ∝ ( | , " ) ( " ) ∝ … !," where (%) ( , ) is the index set of cells with cell type label and spatial domain label in the th tissue section and |⋅| is the cardinality of the corresponding index set. Therefore, the full conditional distribution of " takes the form of a Dirichlet distribution, that is

Posterior sampling of :
The full conditional distribution of takes the following form: The hyper-parameter in the Potts model is difficult to infer algorithmically because of the normalization constant (%) ( ). In particular, the computation of (%) ( ) requires evaluating the probability mass function of the Potts model over all possible configurations of (%) and is thus known to be NP hard. Instead of computing the normalizing constant directly, we estimate the ratio of two normalizing constants by adapting the Swendsen-Wang algorithm, which allows us to sample from its conditional distribution through a Metropolis-Hastings algorithm. Specifically, we use a uniform distribution centered at the current value of as our proposal distribution, where the step size is set to be 0.1 by default. Then, the acceptance probability for the proposed value